Mobile Strategy Game Analysis

Kutay Akalın
05/01/2020

1. Dataset

Data includes 17007 strategy games on the Apple App Store. It was collected on the 3rd of August 2019, using the iTunes API and the App Store sitemap. For this analysis, data downloaded directly from Kaggle website as CSV format and uploded the personal Github page.

I chose this dataset because I have great interest about the game app market. I want to analyse popular sub-genres, their user ratings, user ratings count to have insight about the user preferences. If a developer can develop a game based on these preferences, in my opinion, it is more likely to be successful.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly as py
import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
import io
import requests
import warnings
warnings.filterwarnings('ignore')

url="https://github.com/pjournal/mef03-KutayAkalin/blob/master/appstore_games.csv?raw=True"
s=requests.get(url).content
appstore_games=pd.read_csv(io.StringIO(s.decode('utf-8')))

display(appstore_games.info())
display(appstore_games.describe())
display(appstore_games.describe(include='O'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17007 entries, 0 to 17006
Data columns (total 18 columns):
URL                             17007 non-null object
ID                              17007 non-null int64
Name                            17007 non-null object
Subtitle                        5261 non-null object
Icon URL                        17007 non-null object
Average User Rating             7561 non-null float64
User Rating Count               7561 non-null float64
Price                           16983 non-null float64
In-app Purchases                7683 non-null object
Description                     17007 non-null object
Developer                       17007 non-null object
Age Rating                      17007 non-null object
Languages                       16947 non-null object
Size                            17006 non-null float64
Primary Genre                   17007 non-null object
Genres                          17007 non-null object
Original Release Date           17007 non-null object
Current Version Release Date    17007 non-null object
dtypes: float64(4), int64(1), object(13)
memory usage: 2.3+ MB
None
ID Average User Rating User Rating Count Price Size
count 1.700700e+04 7561.000000 7.561000e+03 16983.000000 1.700600e+04
mean 1.059614e+09 4.060905 3.306531e+03 0.813419 1.157064e+08
std 2.999676e+08 0.751428 4.232256e+04 7.835732 2.036477e+08
min 2.849214e+08 1.000000 5.000000e+00 0.000000 5.132800e+04
25% 8.996543e+08 3.500000 1.200000e+01 0.000000 2.295014e+07
50% 1.112286e+09 4.500000 4.600000e+01 0.000000 5.676895e+07
75% 1.286983e+09 4.500000 3.090000e+02 0.000000 1.330271e+08
max 1.475077e+09 5.000000 3.032734e+06 179.990000 4.005591e+09
URL Name Subtitle Icon URL In-app Purchases Description Developer Age Rating Languages Primary Genre Genres Original Release Date Current Version Release Date
count 17007 17007 5261 17007 7683 17007 17007 17007 16947 17007 17007 17007 17007
unique 16847 16847 5010 16847 3803 16473 8693 4 990 21 1004 3084 2512
top https://apps.apple.com/us/app/angry-car-parkin... Majesty: The Fantasy Kingdom Sim - Free Emoji Stickers https://is4-ssl.mzstatic.com/image/thumb/Purpl... 0.99 #NAME? Tapps Tecnologia da Informa\xe7\xe3o Ltda. 4+ EN Games Games, Strategy, Puzzle 2/09/2016 1/08/2019
freq 2 2 14 2 943 17 123 11806 12467 16286 778 75 118

We see that there are null (NaN) objects in the some columns. We can calculate the empty values with:

In [2]:
appstore_games.isna().sum()
Out[2]:
URL                                 0
ID                                  0
Name                                0
Subtitle                        11746
Icon URL                            0
Average User Rating              9446
User Rating Count                9446
Price                              24
In-app Purchases                 9324
Description                         0
Developer                           0
Age Rating                          0
Languages                          60
Size                                1
Primary Genre                       0
Genres                              0
Original Release Date               0
Current Version Release Date        0
dtype: int64
In [3]:
msno.matrix(appstore_games)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d13606e940>

2. Data Preprocessing

Firstly, we need to fill or delete empty values. If user raiting count for apps below five, both 'User Raiting Count' and Average User Raitings have NaN values. Thus, they need to be cleaned for further analysis. In adition, I add "Free_or_Paid" column.

In [4]:
#Dropping unnecessary columns
gameanalyse = appstore_games.copy()
gameanalyse = gameanalyse.drop(columns="URL")
gameanalyse = gameanalyse.drop(columns="Icon URL")
gameanalyse = gameanalyse.drop(columns="ID")


#Dropping User Rating NaN Values
gameanalyse = gameanalyse[pd.notnull(gameanalyse['User Rating Count'])]
len(gameanalyse)

# Changing date colums, object to date-time
gameanalyse['Original Release Date'] = pd.to_datetime(gameanalyse['Original Release Date'], format = '%d/%m/%Y')
gameanalyse['Current Version Release Date'] = pd.to_datetime(gameanalyse['Current Version Release Date'], format = '%d/%m/%Y')
#Adding Free - Paid Column
for i in gameanalyse.index:
        
    if gameanalyse.loc[i,'Price'] == 0.0:
        gameanalyse.loc[i,'Free_or_Paid'] = 'Free'
    else:
        gameanalyse.loc[i,'Free_or_Paid'] = 'Paid'
    
gameanalyse=gameanalyse.reset_index()
gameanalyse.head()
Out[4]:
index Name Subtitle Average User Rating User Rating Count Price In-app Purchases Description Developer Age Rating Languages Size Primary Genre Genres Original Release Date Current Version Release Date Free_or_Paid
0 0 Sudoku NaN 4.0 3553.0 2.99 NaN Join over 21,000,000 of our fans and download ... Mighty Mighty Good Games 4+ DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT... 15853568.0 Games Games, Strategy, Puzzle 2008-07-11 2017-05-30 Paid
1 1 Reversi NaN 3.5 284.0 1.99 NaN The classic game of Reversi, also known as Oth... Kiss The Machine 4+ EN 12328960.0 Games Games, Strategy, Board 2008-07-11 2018-05-17 Paid
2 2 Morocco NaN 3.0 8376.0 0.00 NaN Play the classic strategy game Othello (also k... Bayou Games 4+ EN 674816.0 Games Games, Board, Strategy 2008-07-11 2017-09-05 Free
3 3 Sudoku (Free) NaN 3.5 190394.0 0.00 NaN Top 100 free app for over a year.\nRated "Best... Mighty Mighty Good Games 4+ DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT... 21552128.0 Games Games, Strategy, Puzzle 2008-07-23 2017-05-30 Free
4 4 Senet Deluxe NaN 3.5 28.0 2.99 NaN "Senet Deluxe - The Ancient Game of Life and A... RoGame Software 4+ DA, NL, EN, FR, DE, EL, IT, JA, KO, NO, PT, RU... 34689024.0 Games Games, Strategy, Board, Education 2008-07-18 2018-07-22 Paid

Genre Data Analysis

For genre analysis, I melted the "Genres" column and extract the sub-genres. After that, I grouped the main sub-genres and shown them in the "Genre" column.

In [5]:
#Grouping Genres
game_genres=gameanalyse.copy()
game_genres=game_genres[game_genres['Primary Genre']=='Games']

game_genres['Genre'] = game_genres['Genres'].str.replace(',', '').str.replace('Games', '').str.replace('Entertainment', '').str.replace('Strategy', '')
game_genres['Genre'] = game_genres['Genre'].str.split(' ').map(lambda x: ' '.join(sorted(x)))
game_genres['Genre'] = game_genres['Genre'].str.strip()

#For empty genre rows (means that has no sub-genre), it is filled with 'General'
index = game_genres.index[game_genres['Genre']==""].tolist()
game_genres.loc[index,'Genre'] = 'General'
game_genres['Genre']

#After analysing the genre distributions, some combined genres distributed the main genres.
game_genres.loc[game_genres['Genre'].str.contains('Puzzle'),'Genre'] = 'Puzzle'
game_genres.loc[game_genres['Genre'].str.contains('Simulation'),'Genre'] = 'Simulation'
game_genres.loc[game_genres['Genre'].str.contains('Action'),'Genre'] = 'Action'
game_genres.loc[game_genres['Genre'].str.contains('Board'),'Genre'] = 'Board'
game_genres.loc[np.logical_and(game_genres['Genre'].str.contains('Role'),game_genres['Genre'].str.contains('Playing')),'Genre'] = 'Role Playing'
game_genres.loc[game_genres['Genre'].str.contains('Casual'),'Genre'] = 'Casual'
game_genres.loc[game_genres['Genre'].str.contains('Card'),'Genre'] = 'Card'
game_genres.loc[game_genres['Genre'].str.contains('Adventure'),'Genre'] = 'Adventure'
game_genres.loc[game_genres['Genre'].str.contains('Sports'),'Genre'] = 'Sports'
game_genres.loc[game_genres['Genre'].str.contains('Family'),'Genre'] = 'Family'
game_genres.loc[game_genres['Genre'].str.contains('Education'),'Genre'] = 'Education'
game_genres.loc[game_genres['Genre'].str.contains('Word'),'Genre'] = 'Word'
game_genres.loc[game_genres['Genre'].str.contains('Music'),'Genre'] = 'Music'
game_genres.loc[game_genres['Genre'].str.contains('Trivia'),'Genre'] = 'Trivia'

#Re-Indexing and selecting necessary coulmns for further anlaysis.

game_genres=game_genres.reset_index()
game_genres=game_genres.drop(columns=['level_0','Primary Genre','Genres'])
game_genres.head()
Out[5]:
index Name Subtitle Average User Rating User Rating Count Price In-app Purchases Description Developer Age Rating Languages Size Original Release Date Current Version Release Date Free_or_Paid Genre
0 0 Sudoku NaN 4.0 3553.0 2.99 NaN Join over 21,000,000 of our fans and download ... Mighty Mighty Good Games 4+ DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT... 15853568.0 2008-07-11 2017-05-30 Paid Puzzle
1 1 Reversi NaN 3.5 284.0 1.99 NaN The classic game of Reversi, also known as Oth... Kiss The Machine 4+ EN 12328960.0 2008-07-11 2018-05-17 Paid Board
2 2 Morocco NaN 3.0 8376.0 0.00 NaN Play the classic strategy game Othello (also k... Bayou Games 4+ EN 674816.0 2008-07-11 2017-09-05 Free Board
3 3 Sudoku (Free) NaN 3.5 190394.0 0.00 NaN Top 100 free app for over a year.\nRated "Best... Mighty Mighty Good Games 4+ DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT... 21552128.0 2008-07-23 2017-05-30 Free Puzzle
4 4 Senet Deluxe NaN 3.5 28.0 2.99 NaN "Senet Deluxe - The Ancient Game of Life and A... RoGame Software 4+ DA, NL, EN, FR, DE, EL, IT, JA, KO, NO, PT, RU... 34689024.0 2008-07-18 2018-07-22 Paid Board
In [6]:
len(game_genres['index'])
Out[6]:
7291

Some of the populer sub-genres displayed below:

In [7]:
popular_genres = game_genres.groupby("Genre").agg('count').sort_values(by=['Name'],ascending=False)
popular_genres = popular_genres[popular_genres['Name']>20].loc[:,'Name'].reset_index()
popular_genres.rename(columns={'Name':'Count'},inplace=True)
popular_genres
Out[7]:
Genre Count
0 Puzzle 1337
1 Simulation 1222
2 Action 999
3 Board 829
4 Role Playing 778
5 Casual 515
6 Card 343
7 Adventure 340
8 Family 281
9 General 274
10 Sports 124
11 Trivia 61
12 Racing 43
13 Word 40
14 Education 33
15 Music 30
16 Casino 24

3. Descriptive Statistics

Average User Rating Bar Chart

In [8]:
plt.figure(figsize=(15, 7))
plt1=sns.countplot(x="Average User Rating", data=gameanalyse,palette="rocket")
plt1.set_ylabel('Frequency', fontsize = 20)
plt1.set_xlabel('Average User Rating', fontsize = 20)
plt.show()

Age Rating Bar Chart

In [9]:
plt.figure(figsize=(15, 7))
plt1=sns.countplot(x="Age Rating", data=gameanalyse)
plt1.set_ylabel('Frequency', fontsize = 20)
plt1.set_xlabel('Age Rating', fontsize = 20)
plt.show()

WordCloud of Names and Subtitles

In [10]:
from wordcloud import WordCloud
fig, ax = plt.subplots(1, 2, figsize=(16,16))
wordcloud = WordCloud(background_color='white',width=800, height=800).generate(' '.join(gameanalyse['Name']))
wordcloud_sub = WordCloud(background_color='white',width=800, height=800).generate(' '.join(gameanalyse['Subtitle'].dropna().astype(str)) )
ax[0].imshow(wordcloud)
ax[0].axis('off')
ax[0].set_title('Wordcloud(Name)')
ax[1].imshow(wordcloud_sub)
ax[1].axis('off')
ax[1].set_title('Wordcloud(Subtitle)')
plt.show()

Distribution of App Prices

In [11]:
x = gameanalyse['Price']
x =x[np.logical_not(np.isnan(x))]

plt.figure(figsize=(16, 8))
sns.kdeplot(x,shade = True, linewidth = 5)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d135aeb4a8>
In [12]:
gameanalyse['Price'].describe()
Out[12]:
count    7561.000000
mean        0.571305
std         2.415658
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       139.990000
Name: Price, dtype: float64

User Rating Count Distribution

In [13]:
x = appstore_games['User Rating Count']
x = x[np.logical_not(np.isnan(x))]

plt.figure(figsize=(16, 8))
sns.kdeplot(x,shade = True, linewidth = 5)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d135af7a58>
In [14]:
gameanalyse['User Rating Count'].describe()
Out[14]:
count    7.561000e+03
mean     3.306531e+03
std      4.232256e+04
min      5.000000e+00
25%      1.200000e+01
50%      4.600000e+01
75%      3.090000e+02
max      3.032734e+06
Name: User Rating Count, dtype: float64

Size

In [15]:
fig = px.box(gameanalyse, y="Size")
fig.update_layout(
    title={
        'text': "Size Box Plot",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()
In [16]:
gameanalyse['Size'].describe()
Out[16]:
count    7.561000e+03
mean     1.514679e+08
std      2.550380e+08
min      2.158400e+05
25%      3.061862e+07
50%      7.964672e+07
75%      1.776138e+08
max      4.005591e+09
Name: Size, dtype: float64

As we can see in the box plot, there is lots of outlier included in the size distribution. We can get better insight by removing outlier values.

In [17]:
q75, q25 = np.percentile(gameanalyse['Size'], [75 ,25])
iqr = q75 - q25
upper = q75 + 1.5*iqr
lower = q25 - 1.5*iqr

size_analyse = gameanalyse[pd.notnull(gameanalyse['Size'])]
size_analyse=size_analyse[np.logical_and(size_analyse['Size']<upper,size_analyse['Size']>lower)]
size_analyse

plt.figure(figsize=(16, 8))
sns.kdeplot(size_analyse['Size'],shade = True,)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d139629c50>
In [18]:
size_analyse['Size'].describe()
Out[18]:
count    7.002000e+03
mean     1.000393e+08
std      9.029810e+07
min      2.158400e+05
25%      2.835174e+07
50%      7.021466e+07
75%      1.496440e+08
max      3.973018e+08
Name: Size, dtype: float64

Age and Average User Rating

In [19]:
ax = sns.FacetGrid(gameanalyse, col="Age Rating", col_wrap=2, height=6, aspect=2,  sharey=False)
ax.map(sns.countplot, 'Average User Rating', alpha = 0.7, linewidth=4, edgecolor= 'black')
plt.subplots_adjust(hspace=0.45)
plt.show()

Game Genre Analysis

In [20]:
fig = px.bar(popular_genres, x='Genre', y='Count')
fig.update_layout(
    title={
        'text': "Bar Graph of Genre's",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

For pairplot relation analysis, popular 5 genre are selected.

In [21]:
sns.set(style="ticks")
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]
game_genres_ds = game_genres.loc[:,["Average User Rating","User Rating Count","Price","Age Rating","Size","Free_or_Paid","Genre"]]

game_genres_ds = game_genres_ds[game_genres_ds.Genre.isin(popular_genres)]

plt.figure(figsize=(16, 8))
sns.set(style="ticks",color_codes=True)
sns.pairplot(game_genres_ds, hue="Genre")
Out[21]:
<seaborn.axisgrid.PairGrid at 0x1d13a659be0>
<Figure size 1152x576 with 0 Axes>
In [22]:
#Popular Genre Ratings
ax = sns.FacetGrid(game_genres_ds, col="Genre", col_wrap=2, height=6, aspect=2,  sharey=False)
ax.map(sns.countplot, 'Average User Rating', alpha = 0.7, linewidth=4, edgecolor= 'black')
plt.subplots_adjust(hspace=0.45)
plt.show()
In [23]:
fig = px.box(game_genres_ds,x="Genre", y="Size")
fig.update_layout(
    title={
        'text': "Genre and Size Box Plot",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

Time Series Analysis of Size

In [24]:
temp_df = gameanalyse.groupby(['Original Release Date']).Size.sum().reset_index()
fig = px.line(temp_df, x='Original Release Date', y='Size')
fig.show()

4. Inferential Statistics

Correlation Heatmap

Correlation heatmap between Average User Rating, Price, User Rating Count and Size columns:

In [25]:
game_genres_cor = gameanalyse.copy()
game_genres_cor = game_genres_cor.drop(columns=["index"])

game_genres_cor = game_genres_cor.corr(method='pearson')

plt.figure(figsize=(14, 14))
ax = sns.heatmap(
    game_genres_cor, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True, annot = True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
ax.set_ylim(len(game_genres_cor)+0.5, -0.5);

Chi-Square

Average User Rating and Genre

Null Hypothesis:
H0: Average User Rating and Genre types are independent.

Alternate Hypothesis:
H1: Average User Rating and Genre types are not independent.

In [26]:
import scipy.stats

table1 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Genre"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table1)
table1
msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Test Statistic: 460.4332656032267
p-value: 1.3849598083268897e-22
Degrees of Freedom: 200

Reject H0 (p_value = 0.000)
We have strong statistical evidence that order Average User Rating is not independent of Genre (p_value = 0.000).


Average User Rating and Price

Null Hypothesis:
H0: Average User Rating and Price are independent.

Alternate Hypothesis:
H1: Average User Rating and Price are not independent.

In [27]:
table2 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Price"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table2)

msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Test Statistic: 116.63921697620206
p-value: 0.7549711073736647
Degrees of Freedom: 128

Can not reject H0 (p_value=0.755).
There is not enough statistical evidence that Average User Rating and Price variables are not independent (p_value = 0.755).


Average User Rating and Age Rating

Null Hypothesis:
H0: Average User Rating and Age Rating are independent.

Alternate Hypothesis:
H1: Average User Rating and Age Rating are not independent.

In [28]:
table2 = pd.crosstab(index=game_genres["Average User Rating"],columns=game_genres["Age Rating"])
chi2, p, ddof, expected = scipy.stats.chi2_contingency(table2)

msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
Test Statistic: 108.17691535859234
p-value: 1.1721692983359176e-12
Degrees of Freedom: 24

Reject H0 (p_value~0.000).
There is strong statistical evidence that Average User Rating and Price variables are not independent (p_value = 0.755).

Z- Test

Average user rating test between board and puzzle games:

Null Hypothesis:
H0: Average user ratings of puzzle games is smaller or equal than board games.

Alternate Hypothesis:
H1: Average user ratings of puzzle games is higher than board games.

In [29]:
from statsmodels.stats.weightstats import ztest

popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]

Puzzle = game_genres[game_genres['Genre'] == 'Puzzle']['Average User Rating']
Board = game_genres[game_genres['Genre'] == 'Board']['Average User Rating']

ztest(Puzzle,Board,alternative = 'larger')
Out[29]:
(7.712338086770961, 6.17666762622181e-15)

There is strong statistical evidence that average user rating of puzzle games are higher than board games.(p-value~0.000)


App size test between board and puzzle games:

Null Hypothesis:
H0: Size of puzzle games is higher or equal than board games.

Alternate Hypothesis:
H1: Size of puzzle games is smaller than board games.

In [30]:
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]

Puzzle = game_genres[game_genres['Genre'] == 'Puzzle']['Size']
Board = game_genres[game_genres['Genre'] == 'Board']['Size']

ztest(Puzzle,Board,alternative = 'smaller')
Out[30]:
(-4.85790299099606, 5.931775266610638e-07)

There is strong statistical evidence that size of puzzle games are smaller than board games.(p-value~0.000)


Average user rating test between +9 and +17 games:

Null Hypothesis:
H0: Average user ratings of +17 games is higher or equal than +9 games.

Alternate Hypothesis:
H1: Average user ratings of +17 games is is smaller than +9 games.

In [31]:
popular_genres= ["Puzzle","Simulation","Action","Board", "Role Playing"]

Puzzle = game_genres[game_genres['Age Rating'] == '17+']['Average User Rating']
Board = game_genres[game_genres['Age Rating'] == '9+']['Average User Rating']

ztest(Puzzle,Board,alternative = 'smaller')
Out[31]:
(-3.2884533799145403, 0.0005036973358100693)

There is strong statistical evidence that average user ratings of +4 games is is smaller than +17 games.(p-value=0.005)

Regression

In [32]:
plt.figure(figsize=(18,10), dpi= 100)
ax = sns.regplot(data=game_genres_ds, x='Size', y='Average User Rating', color = 'darkred')
ax.set_ylabel('Average User Rating', fontsize = 20)
ax.set_xlabel('Size', fontsize = 20)
plt.show()

As we see in the graph, there is little positive relationship between size and average value.


We know that correlation values of numeric variables are very low for average user rating and it depends on some categorical variables such as genre and age rating. But, I tried to do regression analysis where dependent variable is average user rating.

For this purpose following model was build for testing:
y(average rating) = B0 + B1(user rating count) + B2(price) + B3*(size)

In [33]:
import statsmodels.api as sm
X = gameanalyse[['User Rating Count','Price','Size']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = gameanalyse['Average User Rating']

X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     Average User Rating   R-squared:                       0.005
Model:                             OLS   Adj. R-squared:                  0.004
Method:                  Least Squares   F-statistic:                     12.26
Date:                 Sat, 04 Jan 2020   Prob (F-statistic):           5.33e-08
Time:                         12:31:37   Log-Likelihood:                -8548.9
No. Observations:                 7561   AIC:                         1.711e+04
Df Residuals:                     7557   BIC:                         1.713e+04
Df Model:                            3                                         
Covariance Type:             nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 4.0332      0.010    398.774      0.000       4.013       4.053
User Rating Count  5.427e-07   2.04e-07      2.661      0.008    1.43e-07    9.42e-07
Price                -0.0032      0.004     -0.887      0.375      -0.010       0.004
Size               1.832e-10   3.43e-11      5.339      0.000    1.16e-10     2.5e-10
==============================================================================
Omnibus:                     1276.571   Durbin-Watson:                   1.792
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2090.124
Skew:                          -1.142   Prob(JB):                         0.00
Kurtosis:                       4.192   Cond. No.                     3.48e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.48e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

As we see in the result, adjusted R-squared value is too low for validate this model. Thus, this relation can not be explain as linear model with these variables.

5. Final Discussion

Best Strategy Games Overall

In [34]:
gameanalyse.sort_values(by=['Average User Rating', 'User Rating Count'], ascending=False)[['Name', 'Average User Rating', 'User Rating Count', 'Size', 'Price', 'Developer','Genres']].head(10)
Out[34]:
Name Average User Rating User Rating Count Size Price Developer Genres
6110 Cash, Inc. Fame & Fortune Game 5.0 374772.0 245957632.0 0.00 Lion Studios Games, Strategy, Entertainment, Simulation
3884 Egg, Inc. 5.0 174591.0 74891264.0 0.00 Auxbrain, Inc. Games, Strategy, Simulation
6704 AFK Arena 5.0 156766.0 225711104.0 0.00 Lilith Games Games, Strategy, Role Playing
4805 South Park: Phone Destroyer\u2122 5.0 156044.0 130186240.0 0.00 Ubisoft Games, Card, Strategy
6382 From Zero to Hero: Cityman 5.0 146729.0 296638464.0 0.00 Heatherglade Ltd Games, Finance, Strategy, Simulation
7101 Sushi Bar Idle 5.0 123606.0 257325056.0 0.00 Green Panda Games Games, Simulation, Entertainment, Strategy
5451 Fire Emblem Heroes 5.0 120283.0 175634432.0 0.00 Nintendo Co., Ltd. Games, Strategy, Role Playing
1428 Bloons TD 5 5.0 97776.0 133326848.0 2.99 Ninja Kiwi Games, Action, Entertainment, Strategy
787 Naval Warfare 5.0 90214.0 43198464.0 0.00 Untapped Games, Strategy, Board, Social Networking
7452 Idle Roller Coaster 5.0 88855.0 234342400.0 0.00 Green Panda Games Games, Simulation, Strategy, Entertainment

Most Reviewed Games

In [35]:
gameanalyse.sort_values(by=['User Rating Count'], ascending=False)[['Name', 'Average User Rating', 'User Rating Count', 'Size','Price', 'Developer','Genres']].head(10)
Out[35]:
Name Average User Rating User Rating Count Size Price Developer Genres
1210 Clash of Clans 4.5 3032734.0 1.612196e+08 0.0 Supercell Games, Action, Entertainment, Strategy
4313 Clash Royale 4.5 1277095.0 1.451080e+08 0.0 Supercell Games, Strategy, Entertainment, Action
6443 PUBG MOBILE 4.5 711409.0 2.384082e+09 0.0 Tencent Mobile International Limited Games, Action, Strategy
1641 Plants vs. Zombies\u2122 2 4.5 469562.0 1.207634e+08 0.0 PopCap Games, Strategy, Entertainment, Adventure
4704 Pok\xe9mon GO 3.5 439776.0 2.815212e+08 0.0 Niantic, Inc. Games, Strategy, Role Playing, Health & Fitness
2012 Boom Beach 4.5 400787.0 2.027858e+08 0.0 Supercell Games, Strategy, Action
6110 Cash, Inc. Fame & Fortune Game 5.0 374772.0 2.459576e+08 0.0 Lion Studios Games, Strategy, Entertainment, Simulation
4884 Idle Miner Tycoon: Cash Empire 4.5 283035.0 4.439747e+08 0.0 Kolibri Games GmbH Games, Simulation, Strategy, Entertainment
35 TapDefense 3.5 273687.0 7.774384e+06 0.0 TapJoy Games, Strategy, Entertainment, Simulation
2719 Star Wars\u2122: Commander 4.5 259030.0 1.230838e+08 0.0 NaturalMotion Games, Entertainment, Action, Strategy

Conclusion

During the inferential analysis, we found that:

  • Game sizes increased annually.
  • Average user rating may not be depend on app price but depends on genre type and age rating.
  • Most popular sub-genre is puzzle games in selected genres. Puzzle games have minimum app size mean, minimum price mean and maximum user rating average.
  • Generally, Role-Playing games app have larger size than other games.
  • Action games tend to have more user rating count.
  • Board games price mean is significantly higher than other games.
  • Games which age rating values are +9, tend to have more average user rating.
  • Correlation between Size, Average User Rating, User Rating Count and Price have smaller value. As we see in the regression chapter, these variables cannot explain the Average User Rating. We can build much more complex algorithm (including categorical variables) for further analysis to predict Average User Rating.

6. References